Author: Nisan Dalva
Kaggle

In this competition we'll try to predict the price of houses at Ames, Iowa.
With 80 explanatory variables describing a lot about the house.
So at first we'll try to find out which variables are probably helpful to predict the price only with EDA.
After that, we'll do some feature engineering to get the most of our dataset.
To build our model, we'll try some familiar algorithms like SGD, Linear Regression, Lasso, Ridge and Gradient Boosting Regression, and try to play with their hyper-parameters. We'll use also feature selection algorithm based on sklearn (RFE).
Update plotly version and install new package sweetviz to visualize datasets.
!pip install --upgrade plotly
!pip install sweetviz
!pip install -U scikit-learn
# import numpy, matplotlib, etc.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import norm
import math
from tqdm.auto import tqdm
import sweetviz as sw
# sklearn imports
import sklearn
from sklearn import set_config
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.linear_model import LinearRegression, SGDRegressor, Lasso, ElasticNet, Ridge
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor
!wget https://gist.githubusercontent.com/NisanDalva/7d86471c4af68d1b5a1cc87effe853f7/raw/b617a2e0ee4e9220d094424c142fcf86e8147245/train_house_df.csv #train set
!wget https://gist.githubusercontent.com/NisanDalva/37eb2b754cfbeb5df1ca7a0c413beb05/raw/fa930e97152ed33a031f1e29714f0438e2f3a645/test_house_df.csv #test set
house_df = pd.read_csv('train_house_df.csv')
house_df_test = pd.read_csv('test_house_df.csv')
display(house_df.head())
print(house_df.shape)
display(house_df_test.head())
print(house_df_test.shape)
As we can see we have 1460 samples on the train set and 1459 on the test set.
for explore the data, we use the house_df (which is the train set).
NOTE: To impute the missing values we'll use the all_df (which is the train and test sets combained).
Let's see some information about:
house_df.info()
So we have 80 features of different types - float64, int64 and object.
We also have a lot of missing values, we will see later how they can be filled so that we will achieve a better result.
Some statistics of the data.
print('For types of `int64 or `float64`:')
display(house_df.describe().T)
display(house_df.describe(include='object').T)
Get the numerical and categorical features.
def determine_dtypes(df):
num_cols = df.select_dtypes(include=[np.int64, np.float64]).columns
cat_cols = df.select_dtypes(include=['object', 'bool']).columns
return num_cols, cat_cols
Let's take a look on the terget variable SalePrice.
# Get the fitted parameters used by the function norm
mu, sigma = norm.fit(house_df['SalePrice'])
fig = px.histogram(data_frame=house_df, x='SalePrice')
fig.update_layout(
title_text=f'SalePrice distribution - mu= {mu:.3f}, sigma= {sigma:.3f}',
xaxis_title_text='SalePrice',
yaxis_title_text='Count',
bargap=0.2, # gap between bars of adjacent location coordinates
)
fig.show()
We can see the most common price is between 140,000 and 149,999 dollars.
Since we have a lot of features, let's use sweetviz to summarize all our data.
usedcars_report = sw.analyze(house_df)
usedcars_report.show_notebook(layout='vertical')
You can notice a few things:
We have features that are roughly normally distributed (such as 1stFlrSF, TotRmsAbvGrd and GarageArea)
Features that maintain a high correlation between them and the target variable (OverallQual, GrLivArea, GarageArea, GarageCars, etc.)
Features that maintain a really low correlation between them and the target variable (OverallCond, BsmtFinSF2, LowQualFinSF, BsmtHalfBath, etc.)
Features that they have one most common value - more than 90% (Street, Utilities, LandSlop, Condition2, etc.)
So I decided to drop those features that have one most common value (more than 90%), and variables that do not maintain correlation between them and the target variable or are very low (less than 0.1).
# will drop them later
to_drop = ['Street', 'LandContour', 'LandSlope', 'Condition1', 'Condition2',
'BsmtFinType2', 'BsmtFinSF2', 'Heating', 'CentralAir', 'Electrical', 'LowQualFinSF', 'BsmtHalfBath',
'KitchenAbvGr', 'Functional', 'PavedDrive', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence',
'MiscFeature', 'MiscVal', 'MoSold', 'YrSold']
Let's identify the ordinal features.
ordinal_features = ['OverallQual', 'OverallCond', 'LotShape', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'HeatingQC',
'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenQual', 'GarageCars',
'TotRmsAbvGrd', 'Fireplaces', 'FireplaceQu', 'GarageYrBlt', 'GarageFinish', 'GarageQual',
'GarageCond']
def plot_boxplots_by_features(df, feature, all_features):
nrows = math.ceil(math.sqrt(len(all_features)))
ncols = math.ceil((len(all_features)/nrows))
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*5, nrows*3))
axes = axes.flatten()
pbar = tqdm(zip(all_features, axes), total=len(all_features), desc='Calculating box-plots')
for i, j in pbar:
sortd = df.groupby([i])[feature].median().sort_values(ascending=False)
sns.boxplot(x=i, y=feature, data=df, palette='plasma', order=sortd.index, ax=j)
j.tick_params(labelrotation=45)
plt.tight_layout()
pbar.set_description(f'Calculating box-plot {i} by {feature}')
# remove unused axes
for i in tqdm(range(len(all_features), nrows*ncols), total=nrows*ncols - len(all_features), desc='Removing unused axes'):
fig.delaxes(axes[i])
numerical_features, categorical_features = determine_dtypes(house_df)
plot_boxplots_by_features(house_df, 'SalePrice', categorical_features)
Pay attention for 3 kinds of those features:
Features that do not so much affect the price of the house, such as LotShape, LandContour, BsmtExposure, etc.
That is, from now on we can conclude that there is no high correlation between them and the target variable.
Features that do affect the target variable like BsmtQual, KitchenQual, PoolQC, MiscFeature, etc.
Features we do not have much information about. Like Utilities and Heating.
Since we have a lot of features, so to make it more readble, let's take a look only on the highest correlations.
I assuming higher correlation is greater than 0.6.
def plot_correlation_greater_than(df, greater_than):
sns.set(font_scale=1.1)
corr = np.abs(df.corr())[np.abs(df.corr()) >= greater_than]
mask = np.triu(corr.corr())
plt.figure(figsize=(27, 18))
ax = sns.heatmap(corr, annot=True, fmt='.2f', cmap='Blues', square=False, linewidth=1, cbar=True, vmin=0, vmax=1, mask=mask)
# format x and y labels
ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=12)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=12)
# set the x labels and ticks on top
# ax.xaxis.tick_top()
# ax.xaxis.set_label_position('top')
plt.show()
def plot_correlation_greater_than_by_feature(df, feature, greater_than):
corr = np.abs(df.corr())[df.corr() > greater_than][feature].dropna().drop(feature).to_frame()
ax = sns.heatmap(corr, annot=True, fmt='.2f', cmap='Blues', square=False, linewidth=1, cbar=True, vmin=0, vmax=1)
plt.show()
plot_correlation_greater_than(house_df, 0.6)
As we can see, 1stFlrSF is highly correlated with TotalBsmtSF (0.82), which it means that probably the first floor and the basement have about the same size.
GarageArea has highly correlated withGarageCars (0.88), so that means there is no unused area, the larger the area the more cars there are.
GarageYrBlt is highly correlated with YearBuilt (0.83), which it means the house and the garage are probably built in the same year, so let's take the maximum of there features.
We can also sum up the total bathrooms in the house and the total porches squared feet to one feature.
def change_features_based_on_corr(df):
df['GarageArea_mul_cars'] = df['GarageArea'] * df['GarageCars']
df['TotalSF'] = df['BsmtFinSF1'] + df['BsmtFinSF2'] + df['1stFlrSF'] + df['2ndFlrSF']
df['TotalBathrooms'] = df['FullBath'] + 0.5 * df['HalfBath'] + df['BsmtFullBath'] + 0.5 * df['BsmtHalfBath']
df['TotalPorchSF'] = df['OpenPorchSF'] + df['3SsnPorch'] + df['EnclosedPorch'] + df['ScreenPorch'] + df['WoodDeckSF']
df["YearLstCnst"] = df[["YearBuilt", "YearRemodAdd"]].max(axis=1)
to_drop.extend(['BsmtFinSF1', '1stFlrSF', '2ndFlrSF', 'BsmtFullBath', 'HalfBath', 'FullBath', 'OpenPorchSF', 'EnclosedPorch', 'WoodDeckSF', 'YearRemodAdd', 'YearBuilt'])
The target variable SalePrice has highly correlated with OverallQual, TotalBsmSF, 1stFlrSF, GrLivArea, GarageCars and GarageArea.
Let's see the distribution of there features with the target.
def plot_regplots_high_corr_by_feature(df, feature, high_corr):
corr = np.abs(df.corr())[np.abs(df.corr()) > high_corr]
corr_indexes = corr.loc[feature].dropna().drop(feature).index
nrows = math.ceil(math.sqrt(len(corr_indexes)))
ncols = math.ceil((len(corr_indexes)/nrows))
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*5, nrows*3))
axes = axes.flatten()
for i, j in zip(corr_indexes, axes):
sns.regplot(x=i, y=feature, data=df, ax=j, order=3, ci=None, color='r',line_kws={'color':'black'})
j.set_title(f'{i} by {feature}')
# remove unused axes
for i in range(len(corr_indexes), nrows*ncols):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
plot_regplots_high_corr_by_feature(house_df, 'SalePrice', 0.6)
We can see we have some outliers on those features which have high correlation with SalePrice, so we can remove them.
def remove_outliers(df):
df.drop(df[(df['OverallQual'] < 5) & (df['SalePrice'] > 200000)].index, inplace=True)
df.drop(df[(df['GrLivArea'] > 4000) & (df['SalePrice'] < 200000)].index, inplace=True)
df.drop(df[(df['GarageArea'] > 1200) & (df['SalePrice'] < 200000)].index, inplace=True)
df.drop(df[(df['TotalBsmtSF'] > 3000) & (df['SalePrice'] > 320000)].index, inplace=True)
df.drop(df[(df['1stFlrSF'] < 3000) & (df['SalePrice'] > 600000)].index, inplace=True)
df.drop(df[(df['1stFlrSF'] > 3000) & (df['SalePrice'] < 200000)].index, inplace=True)
df.drop(df[(df['GarageCars'] < 4) & (df['SalePrice'] > 620000)].index, inplace=True)
df.reset_index(inplace=True, drop=True)
remove_outliers(house_df)
plot_correlation_greater_than_by_feature(house_df, 'SalePrice', 0.6)
We were able to increace a little bit the correlation between GarageCars and GarageArea with the target variable.
So it seems about right to remove those outliers.
NOTE: Although train and test are two different data, filling the missing values is done exactly the same, so we will do them together in a data called all_df
all_df = pd.concat([house_df, house_df_test], keys=['train', 'test']).drop(columns='SalePrice')
display(all_df)
print(all_df.shape)
We have 2908 samples at all (after removing the outliers).
Now we need to check if we have NaN values.
there are 3 different kinds of NaN values:
'' - an empty stringNone - an empty value of pythonnp.NaN - a numpy empty valueTo do so, since we don't know how look a nan values on this dataset, we would like to change all empty values to np.NaN value.
def change_NaNs(df):
df.replace('', np.NaN, inplace=True)
df.fillna(np.NaN, inplace=True)
change_NaNs(all_df)
Let's see in which features has NaN values at all.
def check_NaNs_values(df):
all = df.isnull().sum().sort_values(ascending=False)
count = all[all != 0]
percent = (count/len(df)) * 100
dtypes = df.dtypes[count.index]
return pd.concat([count, percent, dtypes], axis=1, keys=['count', 'percent', 'dtype'])
# return pd.concat([count, percent], axis=1, keys=['count', 'percent'])
check_NaNs_values(all_df)
Let's take a look on this pie chart:
def plot_sunburst_with_NaNs(df, col1, col2):
# col1 has NaN values
cp = df.copy()
cp[col1] = cp[col1].apply(lambda x: 'NaN' if x is np.NaN else x) # because sunburst can't work with np.NaN
# all_df_cp.insert(len(all_df_cp.columns), "count", 1, True)
fig = px.sunburst(cp, path=[col1, col2], width=500, height=500)
# fig.update_layout(margin=dict(t=10, l=10, r=10, b=10))
fig.show()
plot_sunburst_with_NaNs(all_df, 'PoolQC', 'PoolArea')
As we see, on this dataset we have 2900 unkown values on PoolQC featrue against 2897 with no pool at all (the pool area is zero).
So, according to data_description.txt file from kaggel if there are no pool, the value of PoolQC supposed to be NA (which it means 'No Pool').
For the rest 3 values, we will fill them randomly.
Set NA for correct samples:
all_df.loc[((all_df['PoolQC'].isnull()) & (all_df['PoolArea'] == 0)), 'PoolQC'] = 'NA'
Those smapels we need to fill randomly.
display(all_df[['PoolQC', 'PoolArea']][all_df['PoolQC'].isnull()])
Pay attention those 3 samples are from the test set, so it won't change the evaluation.
Define two methods who fill NaNs values, the first one fill from specific column, the second one fill from specific list.
def fill_na_random_pick_column_distribution(df, column_name):
df_not_null = df[~df[column_name].isnull()]
df[column_name] = df[column_name].apply(lambda x: np.random.choice(df_not_null[column_name]) if pd.isnull(x) else x)
def fill_na_random_from_list(df, column_name, fill_from):
df[column_name] = df[column_name].apply(lambda x: np.random.choice(fill_from) if pd.isnull(x) else x)
options = ['Ex', 'Fa', 'Gd']
fill_na_random_from_list(all_df, 'PoolQC', options)
Now we don't have NaN values in this feature.
# fig = px.sunburst(all_df, path=['PoolQC', 'PoolArea'])
# fig.update_layout(margin=dict(t=10, l=10, r=10, b=10))
# fig.show()
# print(all_df['PoolQC'].isnull().any())
plot_sunburst_with_NaNs(all_df, 'PoolQC', 'PoolArea')
print(all_df['PoolQC'].isnull().any())
Let's take a look on this pie chart:
plot_sunburst_with_NaNs(all_df, 'MiscFeature', 'MiscVal')
As we see, on this dataset we have 2814 unkown values on MiscFeature featrue against 2813 with no value at all (the misc values is zero).
So, according to data_description.txt file from kaggel if there are no misc value, the value of MiscFeature supposed to be NA (which it means 'None').
For the rest 1 value, we will fill it randomly.
Set NA for correct samples:
all_df.loc[((all_df['MiscFeature'].isnull()) & (all_df['MiscVal'] == 0)), 'MiscFeature'] = 'NA'
This smaple we need to fill randomly.
display(all_df[['MiscFeature', 'MiscVal']][all_df['MiscFeature'].isnull()])
Pay attention those 3 samples are from the test set, so it won't change the evaluation.
options = ['Elev', 'Gar2', 'Othr', 'Shed', 'TenC']
fill_na_random_from_list(all_df, 'MiscFeature', options)
Now we don't have NaN values in this feature.
plot_sunburst_with_NaNs(all_df, 'MiscFeature', 'MiscVal')
print(all_df['MiscFeature'].isnull().any())
According to data_description.txt file from kaggel if there are no Alley value, the value of Alley supposed to be NA (which it means no alley access at all).
So let's fill all the NaN values by NA.
all_df['Alley'] = all_df['Alley'].fillna('NA')
According to data_description.txt file from kaggel if there are no Fence value, the value of Fence supposed to be NA (which it means no fence at all).
So let's fill all the NaN values by NA.
all_df['Fence'] = all_df['Fence'].fillna('NA')
According to data_description.txt file from kaggel if there are no FireplaceQu value, the value of FireplaceQu supposed to be NA (which it means no fireplace at all).
So let's fill all the NaN values by NA.
all_df['FireplaceQu'] = all_df['FireplaceQu'].fillna('NA')
Since the area of each street connected to the house property, most likely have a similar area to other houses in its neighborhood.
Let's fill in missing values by the median LotFrontage of the neighborhood.
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_df["LotFrontage"] = all_df.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
According to data_description.txt file from kaggel if there are no values in GarageType, GarageFinish, GarageQual and GarageCond, the values of thos features supposed to be NA (which it means no garage at all).
So let's fill all the NaN values by NA.
for col in ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']:
all_df[col] = all_df[col].fillna('NA')
Since there are no garage when the value GarageType is NA, its mean no cars in such garage.
So let's fill all the NaN values by 0.
for col in ['GarageYrBlt', 'GarageArea', 'GarageCars']:
all_df[col] = all_df[col].fillna(0)
According to data_description.txt file from kaggel if there are no values in BsmtExposure, BsmtCond, BsmtQual, BsmtFinType2 and BsmtFinType1 the values of thos features supposed to be NA (which it means no basemant at all).
So let's fill all the NaN values by NA.
for col in ['BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1']:
all_df[col] = all_df[col].fillna('NA')
Since are no values in BsmtExposure, BsmtCond, BsmtQual, BsmtFinType2 or BsmtFinType1 it means there are no basement at all.
So let's fill all the NaN values by 0.
for col in ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']:
all_df[col] = all_df[col].fillna(0)
According to data_description.txt file from kaggel if there are no values in MasVnrArea the values of this feature supposed to be None (which it means no Masonry veneer type at all).
So let's fill all the NaN values in MasVnrArea by None, and all the NaN values in MasVnrArea by 0.
all_df['MasVnrType'] = all_df['MasVnrType'].fillna('None')
all_df['MasVnrArea'] = all_df['MasVnrArea'].fillna(0)
For all these features there are between 1 and 2 NaN values, and there is no instruction in the data_description.txt file.
So let's fill them randomly.
for col in ['MSZoning', 'Functional', 'Electrical', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType']:
fill_na_random_pick_column_distribution(all_df, col)
Let's take a look on this pie chart:
def create_pie_chart_of_count(df, column_name):
df_not_null = df[~df[column_name].isnull()]
fig = px.pie(df_not_null.groupby([column_name]).size().reset_index(name='count'), names=column_name, values='count',
width=400, height=400)
fig.show()
create_pie_chart_of_count(all_df, 'Utilities')
We can see there is only one values with NoSeWa and the rest is AllPub.
The only one NoSeWa is from house_df (the train set):
create_pie_chart_of_count(house_df, 'Utilities')
So, we can understand that since the house with NoSeWa is on the train set, this feature won't help us to predict the target.
So let's remove it.
all_df.drop(labels='Utilities', axis=1, inplace=True)
Now we don't have missing values at all. (Except the SalePrice from the train set)
check_NaNs_values(all_df)
Now we don't have missing values at all!
print(all_df.isnull().sum().sum())
print(all_df.shape)
Let's create more features based on different approaches.
Create a boolean features which 1 is availble and 0 is not.
def create_boolean_features(df):
df['HasPool'] = df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
df['Has2ndFloor'] = df['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
df['HasGarage'] = df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
df['HasBsmt'] = df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
df['HasFireplace'] = df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
to_drop.extend(['Fireplaces'])
Convert some numerical features that actually categorical .
def set_as_categorical(df):
df['MSSubClass'] = df['MSSubClass'].apply(str)
df['OverallCond'] = df['OverallCond'].astype(str)
df['YrSold'] = df['YrSold'].astype(str)
df['MoSold'] = df['MoSold'].astype(str)
Create features based on sum of quality and conditions.
def features_based_on_quality_and_conditions(df):
df['TotalExtQual'] = (df['ExterQual'] + df['ExterCond'])
df['TotalBsmQual'] = df['BsmtQual'] + df['BsmtCond']
df['TotalGrgQual'] = df['GarageQual'] + df['GarageCond']
df['TotalQual'] = df['OverallQual'] + df['TotalExtQual'] + df['TotalBsmQual'] + df['TotalGrgQual'] + df['KitchenQual'] + df['HeatingQC']
ordinal_features.extend(['TotalExtQual', 'TotalBsmQual', 'TotalGrgQual', 'TotalQual'])
to_drop.extend(['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
'GarageQual', 'GarageCond', 'KitchenQual', 'HeatingQC'])
Let's split back the data and remove the unneunnecessary variable Id.
house_df_train_not_null = all_df.loc['train'].copy().join(house_df['SalePrice'])
house_df_test_not_null = all_df.loc['test'].copy()
id_train = house_df['Id']
id_test = house_df_test['Id']
house_df_train_not_null.drop(columns='Id', inplace=True)
house_df_test_not_null.drop(columns='Id', inplace=True)
Convert the ordinal features:
def ordinal_encode(df, ord_features):
oe = OrdinalEncoder(dtype=np.int64)
oe_df = pd.DataFrame(oe.fit_transform(df.loc[:, ord_features]), columns=ord_features)
return pd.concat([df.drop(columns=ord_features), oe_df], axis=1)
house_df_train_not_null = ordinal_encode(house_df_train_not_null, ordinal_features)
house_df_test_not_null = ordinal_encode(house_df_test_not_null, ordinal_features)
print(house_df_train_not_null.isnull().sum().sum())
print(house_df_train_not_null.shape)
print('\n')
print(house_df_test_not_null.isnull().sum().sum())
print(house_df_test_not_null.shape)
So let's execute all our preparing:
def feature_engineering(df):
change_features_based_on_corr(df)
create_boolean_features(df)
set_as_categorical(df)
features_based_on_quality_and_conditions(df)
df.drop(columns=to_drop, inplace=True)
feature_engineering(house_df_train_not_null)
feature_engineering(house_df_test_not_null)
print(house_df_train_not_null.isnull().sum().sum())
print(house_df_train_not_null.shape)
print('\n')
print(house_df_test_not_null.isnull().sum().sum())
print(house_df_test_not_null.shape)
Let's start to build our model.
X = house_df_train_not_null.drop(columns='SalePrice')
t = house_df_train_not_null['SalePrice']
# encode a given dataframe
def encode(df):
numerical_cols, categorical_cols = determine_dtypes(df)
ct_enc_std = ColumnTransformer([
("encoding", OrdinalEncoder(), categorical_cols),
("standard", StandardScaler(), numerical_cols)])
df_encoded = pd.DataFrame(ct_enc_std.fit_transform(df))
return df_encoded
print(X.isnull().sum().sum())
print(X.shape)
Define a method that calculate the CV. (A bit changed from Aviad's notebook)
def get_cv_score_and_loss(X, t, model_name, model, k=None, show_score_loss_graphs=False, use_pbar=True):
scores_losses_df = pd.DataFrame(columns=['fold_id', 'split', 'score', 'loss'])
if k is not None:
cv = KFold(n_splits=k, shuffle=True, random_state=1)
else:
raise ValueError('you need to specify k in order for the cv to work')
if use_pbar:
pbar = tqdm(desc=f'Computing Model {model_name}', total=cv.get_n_splits(X))
for i, (train_ids, val_ids) in enumerate(cv.split(X)):
X_train = X.iloc[train_ids]
t_train = t.iloc[train_ids]
X_val = X.iloc[val_ids]
t_val = t.iloc[val_ids]
model.fit(X_train, t_train)
y_train = model.predict(X_train)
y_val = model.predict(X_val)
scores_losses_df.loc[len(scores_losses_df)] = [i, 'train', model.score(X_train, t_train), mean_squared_error(t_train, y_train, squared=False)]
scores_losses_df.loc[len(scores_losses_df)] = [i, 'val', model.score(X_val, t_val), mean_squared_error(t_val, y_val, squared=False)]
if use_pbar:
pbar.update()
if use_pbar:
pbar.close()
val_scores_losses_df = scores_losses_df[scores_losses_df['split']=='val']
train_scores_losses_df = scores_losses_df[scores_losses_df['split']=='train']
mean_val_score = val_scores_losses_df['score'].mean()
mean_val_loss = val_scores_losses_df['loss'].mean()
mean_train_score = train_scores_losses_df['score'].mean()
mean_train_loss = train_scores_losses_df['loss'].mean()
fig_score = px.line(scores_losses_df, x='fold_id', y='score', color='split', title=f'Model name: {model_name}, Mean Val Score: {mean_val_score:.2f}, Mean Train Score: {mean_train_score:.2f}')
fig_loss = px.line(scores_losses_df, x='fold_id', y='loss', color='split', title=f'Model name: {model_name}, Mean Val Loss: {mean_val_loss:.2f}, Mean Train Loss: {mean_train_loss:.2f}')
if show_score_loss_graphs:
fig_loss.show()
fig_score.show()
return mean_val_score, mean_val_loss, mean_train_score, mean_train_loss, fig_score, fig_loss
Let's begin with find the best algorithm for our data.
# all models
hp_models = {
'SGD Regressor': SGDRegressor(random_state=1, ),
'LinearRegression': LinearRegression(),
'Ridge': make_pipeline(RobustScaler(), Ridge()),
'Lasso': make_pipeline(RobustScaler(), Lasso(random_state=1)),
'Elastic Net': make_pipeline(RobustScaler(), ElasticNet(random_state=1, alpha=0.05)),
'Gradient Boosting Regressor': GradientBoostingRegressor(n_estimators=100, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =5)
}
%%time
def get_models_score_and_loss(X, t, models, numerical_cols, categorical_cols):
# a DataFrame to save results
results = pd.DataFrame(columns=['mean val score', 'mean val loss', 'mean train score', 'mean train loss', 'fig_score', 'fig_loss'],
index=hp_models.keys())
pbar = tqdm(models.items(), total=len(models.keys()))
for model_name, model in pbar:
pbar.set_description(f'Calculating model {model_name}')
preprocessor = ColumnTransformer([
("onehot", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard", StandardScaler(), numerical_cols)
])
pipe = make_pipeline(preprocessor, model)
val_score, val_loss, train_score, train_loss , fig_score, fig_loss = get_cv_score_and_loss(X, t, model_name, pipe, k=10,
show_score_loss_graphs=False, use_pbar=False)
results.loc[model_name] = [val_score, val_loss, train_score, train_loss, fig_score, fig_loss]
return results
numerical_features, categorical_features = determine_dtypes(X)
res = get_models_score_and_loss(X, t, hp_models, numerical_features, categorical_features)
These warnings are because we have reached the maximum number of iterations in ElasticNet model.
# show all results
display(res.iloc[:, :-2])
# show the graphs for the best one of val score
res.iloc[np.argmax(res['mean val score'])]['fig_score'].show()
res.iloc[np.argmax(res['mean val score'])]['fig_loss'].show()
Pick from the table the best model (Relative to mean score val).
best_model = hp_models[res.iloc[np.argmax(res['mean val score'])].to_frame().columns[0]]
best_model_name = res.iloc[np.argmax(res['mean val score'])].name
print(best_model_name)
print(best_model)
As we can see the best result we got with SGDRegressor algorithm.
So we will keep use it.
Let's see if we can get better results by increasing the degree.
%%time
# show graph of score and loss by plynomial degree of numerical features
def show_degree_graphs_cv_train(X, t, model_name, model, k=None, p=None, max_degree=10):
numerical_cols, categorical_cols = determine_dtypes(X)
val_train_score_loss_df = pd.DataFrame(columns=['degree', 'split', 'score', 'loss'])
for i in tqdm(range(1, max_degree), desc='Poly Degree'):
ct_enc_std_poly = ColumnTransformer([
("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
("standard_poly", make_pipeline(PolynomialFeatures(degree=i), StandardScaler()), numerical_cols)])
model_pipe = make_pipeline(ct_enc_std_poly, model)
val_score, val_loss, train_score, train_loss , fig_score, fig_loss = get_cv_score_and_loss(X, t, model_name, model_pipe, k=k, show_score_loss_graphs=False, use_pbar=False)
val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'train', train_score, train_loss]
val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'cv', val_score, val_loss]
fig = px.line(val_train_score_loss_df, x='degree', y='score', color='split')
fig.show()
fig = px.line(val_train_score_loss_df, x='degree', y='loss', color='split')
fig.show()
show_degree_graphs_cv_train(X, t, best_model_name, best_model, k=10 ,max_degree=5)
Since we a lot of features, make them squared and even more caused a dimensional explosion and it's really bad for our data.
So it's absolutely clear that we will stay with degree=1.
We want to choose the best features of our data to get best results of prediction.
We'll use RFE method from sklearn package that actually use Backward Feature Selection algorithm (i.e we start from the full feature set and remove features until we reach the number of minimum features or until we reach the best score).
So let's define a method who calculate a best number of features to select, and shows the appropriate graph.
%%time
def feature_selection(X, t, model_name, model):
X_encoded = encode(X)
results_df = pd.DataFrame(columns=['num_of_features', 'mean_val_score', 'mean_val_loss', 'mean_train_score', 'mean_train_loss', 'best_features'])
pbar = tqdm(range(10, len(X_encoded.columns) + 1), total=len(X_encoded.columns)+1 - 10)
for i in pbar:
pbar.set_description(f'Calculating for {i} features')
selector = RFE(model, n_features_to_select=i).fit(X_encoded, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss, fig_score, fig_loss = get_cv_score_and_loss(X_encoded, t, model_name, selector, k=10, use_pbar=False)
if mean_val_score < 0:
break
results_df.loc[len(results_df)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss, selector.support_]
fig = go.Figure()
fig.add_trace(go.Scatter(x=results_df['num_of_features'], y=results_df['mean_val_score']))
fig.update_xaxes(title_text='Number of features selected')
fig.update_yaxes(title_text='Cross validation score')
fig.show()
return results_df
results_scores = feature_selection(X, t, best_model_name, best_model)
The algorithm founds that after 25 features we reach a negative score, so we stopped it.
display(results_scores.iloc[:, :-1])
Let's determine the best features and score RFE found.
best_features = results_scores.iloc[np.argmax(results_scores['mean_val_score'])]['best_features']
num_of_best_features = results_scores.iloc[np.argmax(results_scores['mean_val_score'])]['num_of_features']
best_score = results_scores['mean_val_score'].max()
X_best_features = X.loc[:, best_features]
print(f'There are {num_of_best_features} features that gives {best_score} score value.')
print('So our data now looks like:')
display(X_best_features)
So let's try to change some of the hyper-parameters of our best model (which is SGRRegressor).
# %%time
def learning_rate_selection(X, t):
min_learning_rate = 0.0001
max_learning_rate = 0.1
delta = 0.0001
X_encoded = encode(X)
results_df = pd.DataFrame(columns=['learning_rate', 'mean_val_score', 'mean_val_loss', 'mean_train_score', 'mean_train_loss'])
pbar = tqdm(np.arange(min_learning_rate, max_learning_rate+delta, delta), total=max_learning_rate // min_learning_rate)
for i in pbar:
pbar.set_description(f'Calculating for learning rate {i}')
model = SGDRegressor(random_state=1, eta0=i, learning_rate="constant").fit(X_encoded, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss, fig_score, fig_loss = get_cv_score_and_loss(X_encoded, t, 'SGDRegression', model, k=10, use_pbar=False)
if mean_val_score < 0:
break
results_df.loc[len(results_df)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
fig = go.Figure()
fig.add_trace(go.Scatter(x=results_df['learning_rate'], y=results_df['mean_val_score']))
fig.update_xaxes(title_text='Learning Rate')
fig.update_yaxes(title_text='Cross validation score (no. of correct classifications)')
fig.show()
return results_df
results_learning_rate = learning_rate_selection(X_best_features, t)
The algorithm founds that in learning rate of 0.0044 we reach a negative score, and probably we're down in the score, so we stopped it.
display(results_learning_rate)
Let's determine the best learning rate and score.
best_learning_rate = results_learning_rate.iloc[np.argmax(results_learning_rate['mean_val_score'])]['learning_rate']
best_score_learning_rate = results_learning_rate['mean_val_score'].max()
print(f'The best learning rate is {best_learning_rate} and gained {best_score_learning_rate} of score.')
Because until now every time we run the model and did not pass the maximum number of iterations (we would get an warning then), we can conclude that there is no effect on the result on this hyper-parameter. So we'll keep use the default max iterations which is 1000.
%%time
def alpha_selection(X, t):
min_alpha = 0.001
max_alpha = 0.1
delta = 0.001
X_encoded = encode(X)
results_df = pd.DataFrame(columns=['alpha', 'mean_val_score', 'mean_val_loss', 'mean_train_score', 'mean_train_loss'])
pbar = tqdm(np.arange(min_alpha, max_alpha+delta, delta), total=max_alpha // min_alpha)
for i in pbar:
pbar.set_description(f'Calculating for alpha {i}')
model = SGDRegressor(random_state=1, eta0=best_learning_rate, learning_rate='constant', alpha=i).fit(X_encoded, t)
mean_val_score, mean_val_loss, mean_train_score, mean_train_loss, fig_score, fig_loss = get_cv_score_and_loss(X_encoded, t, 'SGDRegression', model, k=10, use_pbar=False)
if mean_val_score < 0:
break
results_df.loc[len(results_df)] = [i, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
fig = go.Figure()
fig.add_trace(go.Scatter(x=results_df['alpha'], y=results_df['mean_val_score']))
fig.update_xaxes(title_text='Alphas')
fig.update_yaxes(title_text='Cross validation score (no. of correct classifications)')
fig.show()
return results_df
results_alpha = alpha_selection(X_best_features, t)
display(results_alpha)
best_alpha = results_alpha.iloc[np.argmax(results_alpha['mean_val_score'])]['alpha']
best_score_alpha = results_alpha['mean_val_score'].max()
print(f'The best alpha is {best_alpha} and gained {best_score_alpha} of score.')
Let's take a look on our final model.
numerical_cols, categorical_cols = determine_dtypes(X_best_features)
final_model = SGDRegressor(random_state=1, eta0=best_learning_rate, learning_rate='constant', alpha=best_alpha)
preprocessor = ColumnTransformer([
("encoding", OrdinalEncoder(), categorical_cols),
("standard", StandardScaler(), numerical_cols)
])
pipe = make_pipeline(preprocessor, final_model)
set_config(display='diagram')
display(pipe)
pipe.fit(X_best_features, t)
house_df_test_best_features = house_df_test_not_null.loc[:, best_features]
preds = pipe.predict(house_df_test_best_features)
submission = pd.DataFrame({'Id': id_test, 'SalePrice': preds})
submission.to_csv('my_submision.csv', index=False)
So in this copatition we were able to predict house prices in Ames, Iowa.
We dealt with a lot of different features, some are relevant and somne are not,
we researched the data and saw, we also applied common sense what can affect the price and what can not - and accordingly we droped features.
This was the big problem in my opinion - to know which features can best predict the prices.
We have seen that using the Linear Regression model gives very poor results on this data so we did not even try to change its parameters.
In addition, in the Elastic Net model, we received a warning that we had reached the maximum number of iterations (by deafult max_iter = 1000) - so we did not use it either.
Finally we choosed SGD Regressor because it got the highest score value, and we changed its parameters accordingly.
# convert to html file
%%shell
jupyter nbconvert --to html '/content/HW2_ML_Nisan_Dalva.ipynb'